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Abstract 

Much of the human cortex seems to be organized into topographic cortical maps. Yet few quantitative 
methods exist for characterizing these maps. To address this issue we developed a modeling framework 
that can reveal group-level cortical maps based on neuroimaging data. PrAGMATiC, a probabilistic 
and generative model of areas tiling the cortex, is a hierarchical Bayesian generative model of cortical 
maps. This model assumes that the cortical map in each individual subject is a sample from a single 
underlying probability distribution. Learning the parameters of this distribution reveals the properties 
of a cortical map that are common across a group of subjects while avoiding the potentially lossy step 
of co-registering each subject into a group anatomical space. In this report we give a mathematical 
description of PrAGMATiC, describe approximations that make it practical to use, show preliminary 
results from its application to a real dataset, and describe a number of possible future extensions. 

In the best understood regions of the human neocortex, functional representations appear to be organized 
into topographic cortical maps: retinotopic and semantic maps in visual cortex [inma, tonotopic maps in 
auditory cortex m. somatotopic maps in somatomotor cortex and numerosity maps in parietal cortex 
[lOj . It seems likely that other regions of the brain-such as prefrontal cortex-are also organized into cortical 
maps, but those maps have yet to be discovered. One major obstacle for defining cortical maps is that 
both anatomy and functional representation can differ substantially between individuals, thwarting current 
methods for combining data across subjects [HIHIIII]. 

The primary question in combining data across subjects is that of correspondence: how do we determine 
which response in subject A corresponds to a given response in subject B? One possibility is by using anatomy. 
Many existing methods align anatomical images of brains from multiple subjects in either volumetric space 
m or surface space [3]. These methods assume that brain anatomy and function are perfectly correlated, 
and that any deviation from this relationship is noise. Yet this assumption has repeatedly been proven false 
OIH]. Indeed, even low-level functional areas such as VI can vary significantly in size across individuals [2]. 

Another way to find correspondence between subjects is by using functional responses. Some new methods 
find corresponding temporal patterns of functional activation in each subject given responses to a complex 
natural stimulus, such as a movie [nig. These methods have proven much more successful at predicting 
functional data in new subjects than anatomical methods (and may in fact be near-optimal for mapping data 
from one subject to another). Working purely in functional space makes it possible to disregard anatomical 
differences between subjects, but it also means that these methods provide little information about the 
organization of cortical maps (indeed they would work just as well even if cortical maps were completely 
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different in each subject). We might be able to learn more about how functional representations on the 
cortex are organized if we can take both function and anatomy into account. 

A third way to find correspondence between subjects is the region-of-interest (ROI) analysis [19], which 
combines both anatomical and functional constraints. In this method, an ROI is defined in each subject 
using a functional localizer contrast. Among voxels that respond significantly to the localizer a single cluster 
is selected as the ROI based on its approximate anatomical location. Then on a different dataset, average 
functional responses within that ROI are compared across subjects. This method does not assume exact 
anatomical correspondence across subjects. However, for many areas of the brain there is no known localizer, 
making it impossible to apply the ROI method. 

With the goal of combining the strengths of purely functional and ROI-based methods we have developed 
PrAGMATiC: a probabilistic and generative model of areas tiling the cortex. This hierarchical Bayesian 
modeling framework accounts for individual differences both in anatomy and function by treating the cortical 
map observed in each subject as a sample from a single underlying probability distribution. This distribution 
consists of two parts: an arrangement model and an emission model. The arrangement model uses a novel 
spring-based approach to describe cortical topography. This flexible approach can account for substantial 
individual differences in the shape, size, and anatomical location of functional brain areas. The emission 
model then generates predicted functional cortical maps based on the map arrangement. 

In this paper we describe the mathematical formulation of this model and show how the various parameters 
can be learned using a Markov chain Monte Carlo technique similar to contrastive divergence m- We then 
show some results from applying PrAGMATiC to a simple example dataset, and describe some possible 
future extensions to the model. 


1 Model formulation 

In the basic instantiation of PrAGMATiC we suppose that the cortex of each subject is tiled with convex 
areas, and that all locations within each area have the same functional properties. This model is shown 
schematically in Figure 1. The location of each area is determined by the location of its centroid, which is 
a single point on the cortical surface. Centroid locations depend on the locations of neighboring centroids 
as well as the locations of some known cortical landmarks, which are identified separately in each subject. 
The functional properties of each area are represented as a vector. The values in this vector could be raw 
BOLD responses, weights from a GLM or voxelwise model, or any derived measure, such as projections of 
voxelwise model weights onto a set of orthogonal components [T3|. The mean functional value for area i is 
denoted Mi. 

The model for each subject is instantiated as a two-layer Bayesian network, with one visible layer and one 
hidden layer. The visible layer units are vertices on the cortical surface mesh. Each vertex is associated 
with a Z?-vector of observed functional values. The vector of observed values for visible unit I in subject s 
is called G TZ^, and the collection of all visible units in a subject is called We assume that all 

visible units are independent of each other given the hidden layer units. 

The hidden layer units are the locations of the area centroids. The location of centroid i in subject s is called 
his, and the collection of all hidden units in subject s is called Hs- 

As a generative model, PrAGMATiC must be able to generate samples from the distribution of visible unit 
vectors. To sample vis we first find the index of the nearest area centroid on the cortical surface, c{H, I, s). 
Then we look up the mean associated with that centroid, Mc[h.i,s)- Finally we draw a sample from a 
multivariate Gaussian distribution with spherical variance: 

Vis ^ M {Mc(^h,i,s)^^vId) ( 1 ) 

The probability distribution over locations of the hidden units is modeled using a physical analogy to a 
system of springs. The ideal length of the spring connecting units i and j is called Lij, and the spring 
constant is Kij. 
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Figure 1: PrAGMATiC model schematic. In PrAGMATiC the arrangement of areas in a cortical map is 
modeled as a physical system of springs, where each pair of area centroids is bound together by a spring. To generate 
a map, each point on the cortex is assigned the functional mean of the closest area centroid. Probability distributions 
over arrangements and maps are formed using Boltzmann distributions. The parameters of these distributions are 
constrained to be common across subjects, but the exact centroid locations are allowed to vary. Parameters are 
learned using maximum likelihood estimation. 
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The full probability distribution for PrAGMATiC is written as the product of the probability of the ar¬ 
rangement, P{H; L, K), and the probability of the visible map given the arrangement, P{V\H; M). Here we 
write these two probability distributions as Boltzmann distributions with energy functions E{H] L, K) and 
E(y\H;M). 

P{V,H;M,L,K) = PiH;L,K)PiV\H;M) (2) 



-^-E(V\H-,M)- 

[ Zh{L,K) \ 

_1 


where Zh{L,K) = E{h-,L,K) e normalizing constants. 

The distribution over arrangements of the hidden layer units is modeled using a Boltzmann distribution with 
the following energy function: 

E{H; L,K) = ^Y. (4) 

Here Kij is the spring constant for the spring linking centroids i and j, Lij is the ideal length of the spring 
linking i and j, and dijs{H) is the geodesic distance across the cortex between centroids i and j in subject s 
and in the arrangement H. We compute these geodesic distances using a heat-based approximation to the 
exact geodesic distance across the cortical surface mesh [5]. This energy function, E{H; L, K), is exactly the 
sum over the spring potential energy for all spring connections in the model. The constant /3 is the inverse 
temperature of the spring system: if /3 is large, then the system has low temperature, and the springs will not 
deviate much from their ideal lengths; if /3 is small, then the system has high temperature, and the springs 
will deviate a lot from their ideal lengths. The normalizing constant for P{H-, L, K) depends on both the 
values of L and K, and is written here as Zh{L,K). 

The distribution over visible unit values is multivariate Gaussian with equal variance in all dimensions and 
zero covariance, but for consistency we write it as an energy-based model. The energy function for the visible 
units is: 

-2 

EiV\H;M) = ^Y.h^-^-(H,i,s)\\l (5) 

l,S 

Here vis is the functional value for visible unit I in subject s, and is the mean functional value 

for the closest hidden layer unit (by geodesic distance across the cortical surface) in the arrangement El to 
visible layer unit I in subject s. The constant uy is the standard deviation of the gaussian, which is assumed 
to be equal in all dimensions. The normalizing constant for P{V\E[]M) depends on cry, but not on any 
of the learned parameters (because this is a Gaussian distribution its normalizing constant is known). It is 
written here as Zy ■ 

1.1 Maximum likelihood estimation 

We use maximum likelihood estimation (MLE) to learn the spring lengths, L, the spring constants, AT, and 
the area functional means, M, based on observed visible unit data, 1/°^®. Here we show the derivation of 
the MLE learning rules from the model formulations given above. 


Spring lengths. The average log likelihood of the spring lengths, L, given the observed data is written: 

E£{L-V°^^) = l^logP(E;'®;L,il,M) (6) 

S 

where N is the number of subjects, s is an index across subjects, and the total probability of the observed 
data given the parameters, P{V°^^; L, K, M) is equal to the expectation of the full probability distribution 
over El: 

P(E/''®; L, K,M) = Y, M)P{h- L, K) (7) 

h 
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To fit our model, we want to increase the average likelihood by changing the spring lengths, L. We can do 
this by differentiating the likelihood function with respect to L, and then changing L by a small amount in 
the direction that will increase the likelihood. This process will be repeated many times. 


Differentiating the likelihood with respect to L, we find: 

fdZH{L,K)Y\ 

-5L- = dL ) 

1 E/. PY^^\h; M)P{h- L, K) 9Eih,L,K) 

NY P(yY\L,K,M) 


The first part of the gradient, which involves the normalizing constant ZH^ can be written as: 

-1 


fdZH{L,K) 

dL 


s.h 


dL 


(9) 


or simply as the expectation of the gradient over H: 

dZH{L,K)\-\ ldE{H-L,K) 

dL ) = - - - 

Thus we note that the entire gradient could be written more simply as the Boltzmann machine learning rule 
from Ackley, Hinton, and Sejnowski, 1985 [T]: 



d^{L;VY YE{H;L,K) \ / dE{H-L,K) \ 

dL \ dL / ^ \ dL / jj\^yobs 


( 11 ) 


However, we retain the earlier formulation in order to make an essential approximation. This gradient is 
impossible to compute exactly because it requires taking the expectation over all possible H. Even though 
H is discrete, there are far too many possible states to reasonably check. Therefore we approximate the 
gradient using only a small number of samples from P{H; L, K). These are obtained using Gibbs sampling, 
wherein the location of each hidden unit is updated sequentially according to the conditional distribution 
P{his\H\is', L, K), where H\is contains the locations of all the hidden units except for unit i in subject s. 
This procedure is used to obtain J samples from P{H\L,K), which are denoted with j = 1...J. 

This approximation is necessary because it is difficult to compute the expected gradient conditioned on the 
visible units. In most energy-based models there is an analytic expression for the conditional distribution of 
the hidden units given the visible units, P{H\V), and thus it is easy to estimate the conditional expected 
gradient. Here it is very expensive to sample from the conditional distribution P{E[\V), but it is cheap to 
sample from P{H\ L, K). 

We use the samples from P{H] L, K) to approximate the expectations over H. This allows us to rewrite the 
gradient function as: 

dYL-,VY ^ 1 ^ dE{Hi-,L,K) 1 

dL js 


This formula shows that the likelihood gradient for L is equal to the difference between the average energy 
gradient across all J arrangement samples (the first term), and a weighted average energy gradient (the 
second term) where the weights are proportional to the probability of the observed data given the 
sampled H. Intuitively, the second term increases the probability of the observed data, while the first term 
decreases the probability of unobserved data (i.e. random samples from the model). 

To compute the energy gradient for each arrangement sample we differentiate the energy function with 
respect to each element of L, giving: 

^E iY - dYH)) (13) 

S 
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Spring constants. Similarly, the gradient of the likelihood with respect to the spring constants, K is 
derived as: 


d^{K] V°^^) 

Wk 


1 dE{Hi -L^K) I ^ Ej P{Vs''^\Hi ; M) 


And the energy gradient is: 


dE{H-L,K) 


S 


(14) 


(15) 


Area functional means. The gradient for M is slightly different because the normalizing constant Zv 
does not depend on M (as the normalizing constant of a Gaussian does not depend on the mean). Thus it 
has a simpler expression: 




/V 2 ^ 


dM 


dM N ^ Y.J ; M) 

And the energy gradient for the mean of area i, dEy/dMi is: 


dE{V\H;M,) _2 
- = a 


dM, 


y {Vu- M,) 


(16) 


(17) 


where the sum is taken only over the visible units /, s for which the closest hidden unit in the arrangement, 
c{H, I, s) is i. 


1.1.1 Learning using MLE 

To obtain high quality, independent samples of H we maintain J parallel Markov chains for each of the N 
subjects. At each learning step we perform one Gibbs sweep through each of the Markov chains. That is, at 
step t in chain j and subject s we draw the sample: 

~ P{H,\Hl’^-^-,L\K^) (18) 

For each of the J samples we compute the energy gradients for L, K, and M, as well as the likelihood of the 
observed data P(V(°^®|i7|’*; M*). Then we compute the average gradients (for L and K) and the weighted 
average gradient according to the data likelihoods. Finally we update L, K, and M by taking a small step 
down these gradients: 




Rt + i 






K^-ck 


M* — Cm 


dl? 


dK* 

aM* 


(19) 

( 20 ) 
( 21 ) 


This process is then repeated for many thousands of steps. Although each step is stochastic (due to the 
finite number of samples, J, that was used to approximate the gradients), the likelihood function will tend 
to increase over time. 
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1.2 Hyperparameters 


The hyperparameters (3 and cry affect learning speed but do not directly affect the learned parameters 
(except by virtue of poor approximation). The inverse spring temperature, /?, determines how stiff or loose 
the springs are. If /? is very high, then the springs will be very stiff, samples of the arrangement H will be 
highly correlated across iterations, and the quality of the gradient steps will suffer. If j3 is very low, then 
the springs will be very floppy, samples of the arrangement H will be highly random, and the quality of the 
gradient steps will also suffer. 

If the standard deviation of the visible unit Gaussian distribution, cry is very low, then one sample from the 
arrangement H will always yield much higher likelihood of V°^‘^ than the others, and the weighted average 
of the gradients across samples will become the difference between the best sample and the other samples. 
If cry is very high, then the likelihood of will be very similar for all samples, and the weighted average 
of the gradients will be almost identical to the simple average across gradients, making learning slow. 

Note further that these hyperparameters interact with each other. If the inverse temperature f3 is high, 
then almost all samples of the arrangement H will be close to the H that minimizes the total spring energy. 
Because the samples will be more similar, the likelihoods will also be more similar, and the weighted average 
of the gradients will again be similar to the simple average. The hyperparameters also interact with the 
number of areas in the model. 

Rather than tuning these parameters directly, we select desired levels of entropy for the distribution over 
each centroid given the other centroids Ph = P{his\H\is-, L, K) and the distribution over observed visible 
unit values given the sampled arrangements Py = P{V°^^\Hi]M). If the average entropy of the centroid 
distribution Ph is lower than the target value, we then lower the inverse temperature /? to make the springs 
more floppy, and if it is higher than the target value we raise j3 to make the springs stiffer. If the entropy 
of the visible unit distribution Py is lower than the target value, we raise the standard deviation cry to give 
higher probability to the less optimal states, and if it is higher than the target value we lower oy to give 
lower probability to the less optimal states. We check the entropies at the end of each learning iteration. 
Then we adjust /? and cry by 10% upward or downward depending on whether the entropies are too high or 
too low. 

High entropy keeps the model from falling into local minima, but also keeps the model from finding very 
high likelihood solutions. Conversely, low entropy allows the model to find high likelihood solutions, but also 
makes it more likely to fall into local minima. To take advantage of both low and high entropy states we use 
an annealing approach, where the entropy target for the centroid distribution Ph is high at the beginning of 
learning, but then is gradually lowered throughout the learning process. This makes the Markov chain take 
larger, more uncorrelated steps at the beginning of learning, but smaller steps at the end. 


1.3 Initialization 

Proper initialization of the PrAGMATiC parameters and hidden variables is essential for successful model 
fitting. In particular, the area centroids need to be initialized in each subject so that the overall topology of 
the spring system is preserved. If the centroids are randomly placed in each subject it is highly likely that 
there will be topological errors, where the entire map or portions thereof will be mirrored in some subjects 
relative to other subjects. Topological errors represent large local minima in the spring energy function, and 
are very difficult for the sampler to overcome without running at high temperature for a long time. 

To avoid this problem we construct pseudorandom arrangements where the topology is nearly identical in 
every subject. First we initialize the centroids in one subject. This can be done randomly or pseudoran- 
domly. Then we compute the geodesic distance across the cortex from each randomly-placed centroid to 
each landmark. To initialize centroids in other subjects we then find the centroid locations that minimize the 
total squared difference in centroid-landmark distance from the distances we measured in the first subject. 
This process guarantees that the topology of the spring system is the same in every subject. 
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Next we decide which springs, out of all possible centroid-centroid pairs, will be included in the model. This 
is done using a process similar to Delaunay triangulation. We use the centroids to construct a Voronoi 
tessellation of the cortex in each subject by assigning every vertex on the surface to the nearest centroid. 
Then we find each pair of centroids that share a border in the Voronoi tessellation. This is done by checking 
whether each pair of neighboring vertices in the cortical surface belong to different centroids. Because the 
Voronoi tessellation is dual to the Delaunay triangulation, we can then construct the Delaunay triangulation 
by simply connecting each pair of neighboring centroids. Only these centroid-centroid springs are then 
considered in the spring model. All possible landmark-centroid springs are also included. 

Based on the initial centroid placements we initialize the spring lengths, L, to be the average spring lengths 
across all the subjects. The spring constants, K, are set to 1 for all springs that are included in the model. 
To initialize the area functional means, M, we compute the mean functional values for each area in the 
parcellation given by the initial centroid placements. 

I. 4 Improvements to stability 

In practice the algorithm as written above converges when the numbers of areas and subjects are both small. 
If these numbers become large, however, (e.g. 128-1- clusters and 5-1- subjects) the algorithm becomes less 
stable. When this occurs, the model tends to prioritize minimum energy solutions over maximum probability 
solutions. That is, the model tries to minimize the total spring energy across all the subjects at the cost 
of poorly explaining the data. This often causes all the centroids to cluster together as far as possible from 
any known landmarks. This type of solution minimizes the energy of the spring system across subjects, but 
does not maximize the probability of the observed data. These effects are exacerbated when /? is high. 

We believe that this problem is caused by bias in drawing samples of the arrangement H. When the spring 
temperature is low, all samples from the distribution over arrangements P{H] L, K) will be very close to the 
minimum spring energy state (i.e. the arrangement H~ that minimizes the total spring energy E{H~; L, K)). 
If the minimum spring energy state is far from the maximum probability state (i.e. the arrangement H* 
that maximizes the probability of the observed data P{V^°^^\H*)), then this could bias the gradient steps 
such that the likelihood of the observed data decreases over time, although the spring energy also decreases. 

One solution to this problem is to increase J, the number of samples that are drawn for each subject at each 

step of the learning algorithm. However, this is expensive, as run time of the learning algorithm is linear in 

J. An alternative solution that incurs almost no additional cost is to add a small perturbation to the ideal 
spring lengths that is tailored to each subject. That is, we replace the global spring length parameters L 
with L -I- As, where As is a subject-specihc length parameter for subject s that is also learned using MLE. 
With this change, the domain of possible minimum spring energy states is much larger, and thus it is much 
more likely that the maximum probability state also has low or minimum spring energy. To ensure that 
As stays small and that the average perturbation across subjects stays close to zero, we add a quadratic 
regularization penalty. 

With this new term, the full probability distribution is now written: 

P{V,H\A-,M,L,K) = P{H\A;L,K)P{V\H;M)P{A) (22) 

= (23) 

where P{A) is a zero-mean Gaussian prior on A. 

The spring energy function becomes: 

E{H\A-, L,K) = ^Y1 (drjsiH) - U, - (24) 

i,3,s 
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and the energy function for A is: 


( 25 ) 



(26) 


The spring energy gradient for A is very similar to that for L\ 


dE{H\K-L,K) 


i^Lij + dijs{Ey) 


And the energy gradient for the Gaussian prior on A is simple: 


dE{A) 


dA, 


— Aijs 


(27) 


IJS 


Because A incorporates a prior, it has a slightly different likelihood gradient than L: 


a^(A; G°^") _ / dE{H; L, K,A)\ / dEiH; L, K, A) 

5A “\ dA /h~\ ^ 




dE{A) 

dA 


(28) 


One issue with this solution is that it changes the meaning of the spring constants, K. If the total subject- 
specific spring lengths are exactly equal to the distances between centroids (i.e. dijs(H) = Lij -\- Aijg for all 
i, j, and s), then the spring constants cannot be learned because the gradient for K becomes zero. It might 
be possible to simply fix K and learn A, then later set K to be the variance of A over subjects. In any case, 
it is probably unwise to learn K and A simultaneously. 


1.5 Improvements to efficiency 

Even with the gradient approximation, this model can be prohibitively expensive to fit. The computational 
complexity of the parameter estimation comes mostly from two operations: drawing samples from the distri¬ 
bution over arrangements P{H) using Gibbs sampling, and computing the likelihood of the observed visible 
unit data P{V°'^^\E[). Of these two steps sampling P{H) is by far the more expensive. However, we can 
make further approximations that simplify this process significantly. 

In the Gibbs sampler we update the location of each centroid, his, based on the locations of the other 
centroids, H\is- This is done by finding the total spring energy given each possible location of the selected 
centroid, summing these energies to find the normalizing constant, and then sampling from the resulting 
multinomial distribution. That is, for each possible location his = I, we compute the total spring energy: 

Eis{his = l) = {dijs{his = l,H\is) - Lij)^ (29) 

3 

Then to find the conditional probability P{his = l*\H\is) we divide by the sum of the energies: 

P{h,s=nH^.s) = ^^i\^^*\, (30) 

2^1 E,s[his = 1) 

To compute this distribution exactly we would need to find the total spring energy for every possible location I 
in the entire cortex, for every centroid, on every sweep of the Gibbs sampler. Even if we precache the geodesic 
distances between every pair of centroids (which would take dozens of gigabytes of memory) this operation 
is orders of magnitude more expensive than any other step in the parameter estimation. 

We can reduce both the computational and memory demands of this step by only allowing the centroids to 
occupy a subset of the locations on the cortical surface. The cortical surfaces that we use have approximately 
150,000 vertices in each hemisphere. We chose 5,000 random vertices in each hemisphere to serve as possible 
centroid locations. This immediately reduced both the computational and memory demands (of precaching 
the geodesic distances) by a factor of 30. Additionally, we exclude any potential centroids that are more 
than 150mm from the original centroid location. In practice these approximations result in slightly reduced 
fidelity of the reconstructed maps but a massive increase in performance. In the future we might be able to 
get around even these limitations by first doing a fast and coarse model fit and then following up by running 
a few iterations with less (or no) subsampling to fine-tune the parameters. 
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2 Running PrAGMATiC on an example dataset 


To test PrAGMATiC we applied it to a motor localizer dataset. This dataset consisted of BOLD fMRI data 
collected from 12 subjects while they underwent a 10-minute motor localizer. During the scan subjects were 
occasionally prompted to wiggle their fingers, wiggle their toes, move their mouths, generate internal speech, 
or generate frequent saccades. Each condition occurred in five 20-second blocks throughout the 10-minute 
scan. Standard fMRI preprocessing was applied, and then a linear model was used to find the change in 
BOLD response of each voxel in each condition relative to the mean BOLD response. The linear model 
weights for each condition were then projected onto cortical surface models that were specially constructed 
for each subject. The five resulting maps (one each for the foot, hand, mouth, speak, and saccade conditions) 
were then used as input data for PrAGMATiC. 

We fit a PrAGMATiC model with 384 centroids using data from the left cortical hemisphere of 11 of the 
12 subjects. For this model we fit the subject-specific edge lengths, A, and did not fit spring constants, K. 
The model fitting ran for 4000 iterations. On each iteration we generated J = 5 samples from P{H; L, K) 
for each subject. The target entropy for the visible unit distribution Py was set to 0.2 (i.e. 20% of the 
maximum possible entropy). The target entropy for the centroid distribution P^ was set to 0.33 at the first 
iteration and then slowly lowered to 0.2 over the course of learning. The standard deviation of the visible 
unit emission model, cry, was initially set to 16.0 and then was updated according to the entropy of the 
visible unit distribution. The inverse spring temperature, j3, was initially set to 0.01 and then was updated 
according to the entropy of the centroid distribution. The learning rate for the spring lengths, cl, was set 
to 20.0. The learning rate for the area functional means, cm, was set to 0.005. The learning rate for the 
subject-specific spring lengths, ca, was set to 1.0. The standard deviation of the Gaussian prior on the 
subject-specific spring lengths, ca, was set to 25.0. 

This model used seven anatomical landmarks to ensure that the maps approximately matched across subjects. 
These landmarks were specified manually in each subjects as vertex locations on the cortical surface. These 
landmarks were the top and bottom of the central sulcus, the front (genu) and back (splenium) of the corpus 
callosum, the frontal pole, the occipital pole, and the temporal pole. 

Some results from this model are shown in Figure 2. In Figure 2A we show how the spring lengths evolved 
for springs connecting one particular landmark (the bottom of the central sulcus) to all the centroids. The 
line for each spring is shaded according to its total change over the course of learning, so that darker lines 
correspond to springs that changed more. Here we see that many of the spring lengths changed relatively 
monotonically over the course of learning, while others moved both up and down. 

In Figure 2B we show how the area functional means evolved for one particular functional dimension corre¬ 
sponding to hand movement in the motor localizer. Similar to Figure 2A, the line for each area is shaded 
according to its total change over the course of learning. Here we see that a few areas seem to become much 
more selective over the course of learning, while many are unchanged. In contrast to the spring lengths where 
the changes appeared mostly monotonic, the area means seem to stabilize after a few thousand iterations. 

In Figure 2C we show how the likelihoods of the observed data given the model evolved. The grey lines 
indicate individual subjects and the red line shows the mean. Likelihoods were computed at each iteration 
as the mean log likelihood across all the visible units given the model state. To ensure that likelihoods are 
comparable even though the visible unit standard deviation, uy, is changing during learning, we computed 
all likelihoods with cry = 1. The likelihood plot for each subject is shown relative to their likelihood at 
iteration 0. Here we see that the likelihood increases monotonically throughout learning, confirming that 
the learning rules we derived are having their intended effect. 

In Figure 2D we show predicted maps for the hand movement dimension in a region around the central 
sulcus at the first iteration and the final iteration. White outlines show the locations of manually defined 
regions including the known hand-responsive primary motor hand area (MIH) and primary somatosensory 
hand area (SIH). Here we see that hand-responsive areas in the initial map are not closely matched to the 
manual definitions. In the final map, however, boundaries of the highly hand-responsive areas closely match 
the manually drawn boundaries. This shows that PrAGMATiG is effectively learning the structure of the 
cortical map. 
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Figure 2: PrAGMATiC results on an example motor localizer dataset. (A) Evolution of spring lengths 
over the conrse of learning for springs between one particular landmark (bottom of the central sulcus) and each area. 
Darker lines show spring lengths that changed more. (B) Evolution of area functional means over the course of 
learning for the functional dimension corresponding to hand movement. Darker lines show functional means that 
changed more. (C) Evolution of the mean log likelihood of the observed data given the model over the course of 
learning. Gray lines show individual subjects, red line shows the mean. Zero is defined as the likelihood at the first 
iteration. (D) Flattened cortical maps showing initial and final predicted hand response maps for a region around 
the central sulcus. Darker red means higher predicted hand response. White outlines show manually defined regions 
such as the primary motor hand area (MIH) and primary somatosensory hand area (SIH). The hnal map matches 
the manually defined areas much more closely than the initial map. 


These results show that the PrAGMATiC learning rules derived here can be used to effectively learn a 
generative model of cortical maps. Future work on this dataset will be directed at comparing PrAGMATiC 
to other methods of cross-subject mapping. 


3 Future Directions 

Because PrAGMATiC is described as a hierarchical Bayesian network it is easy to extend in various ways. 
Here we describe some possible future extensions to PrAGMATiC that we believe will prove highly useful 
for neuroimaging research. 


Extended hierarchical models. One interesting future use case of PrAGMATiC will be comparing cor¬ 
tical maps between groups of subjects (e.g. patient groups vs. controls) or between different conditions (e.g. 


11 







































different attentional states) within the same subjects. This can be done by extending the PrAGMATiC hier¬ 
archical model to include group-specific area functional means, spring lengths, or spring constants. Standard 
Bayesian model comparison techniques could then be used to test specific hypotheses about the differences 
between cortical maps. For example, we could test whether two groups of subjects differ only in area func¬ 
tional means and not spring lengths, or vice versa. This type of cortical map comparison is not possible 
using any current techniques. 


Generic map generating functions. One obvious extension is using map generating functions other 
than the Voronoi tessellation detailed above. That is, in the energy function for the visible units, we can 
more generally write: 

E{V\H-M) = ^J2 (31) 

l,S 

where f{H,M) is a function that takes an arrangement, H, and functional means, M, and converts them 
into a vector over all visible units in each subject. The only requirement that we have for this function is 
that it must be differentiable with respect to M so that we can define: 


dE{V\H;M,) 

dM, 


l,S 


df{H,M)is 

dM, 


(32) 


For example, this function could take the form of a smooth spline interpolation between the area centroids. 
Each possible map generating function could be thought of as a hypothesis about how cortical maps are 
organized. Then standard Bayesian model comparison techniques could be used to compare these hypotheses. 
This would enable us to test, for example, whether homogeneous functional areas (as used here) or smooth 
gradients better describe certain cortical maps. 


Covariance-based models. The model described here assumes that functional values are directly compa¬ 
rable across subjects. In resting state functional connectivity experiments there is no first-order relationship 
between functional values across subjects, but the covariance of functional responses across areas seems to 
be conserved [1]. PrAGMATiC could potentially be extended to capture these types of relationships by 
modeling the covariance matrix of visible unit responses. 

That is, we could assume that the covariance of T measured functional values (e.g. a length T resting 
state timecourse) across the Ny total visible units within a subject is drawn from a Wishart distribution: 

~ Wnv (^) T). Here E is an Ny x Ny symmetric matrix. Instead of learning parameters that describe 
the mean functional values within each area, M, we would learn parameters that describe the functional 
covariance between areas, such that 17^ is the average covariance between visible units in area i and 
visible units in area j. We define 'Ekls^ which is the model’s best guess at the covariance between visible 
units k and I in subject s, as Eus = ^ciH,k,s),c{H,i,s)- 

Then we could learn the covariance parameters, fl, the spring lengths, L, and the spring constants, AT, so 
as to maximize the mean log likelihood of the observed covariances across subjects. This covariance-based 
formulation would allow us to learn PrAGMATiG parcellations using only resting state data. 


Direct area-wise model fitting. In early applications of PrAGMATiC we have assumed that the func¬ 
tional values for each visible unit capture functional selectivity. For example, these values might be derived 
from low dimensional projections of voxel-wise model weights [13] . Yet it would also be possible to combine 
the voxel-wise model htting step into the PrAGMATiC model, yielding an end-to-end Bayesian model. 

In voxel-wise modeling (VM) a complex natural stimulus is shown to subjects and then timecourses repre¬ 
senting features of interest are extracted from the stimulus. VM htting then uses ridge regression (or another 
regularized linear model) to estimate how the BOLD response in each voxel is inhuenced by each feature. 
Rather than estimating a PrAGMATiC model using the already estimated VM weights, we could directly 
estimate a separate linear model for each PrAGMATiC area. The functional timecourse for each visible unit. 
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vis{t) could be modeled as a multivariate Gaussian distribution: vis{t) ~ Here we define 

/Ti, the mean timecourse for area i, as a linear combination of the feature timecourses, X: = Xfii. During 
model fitting we would learn the linear weights, j3, the spring lengths, L, and the spring constants, K. 


Automatic ROI labeling. PrAGMATiC could also be used to automatically label well-known ROIs, such 
as the fusiform face area (FFA) [T^ and parahippocampal place area (PPA) [7]. In this use case we would 
assume that the area membership of each visible unit is not known, but the functional values have been 
observed. For example, we might estimate a PrAGMATiC model using localizer data from a large number of 
subjects. Then we would assign each area a label, such as “FFA”, based on anatomical location and functional 
selectivity (many areas may be assigned to each ROI). After acquiring localizer data for a new subject, we 
could then label known ROIs by applying the estimated and labeled PrAGMATiC model. That is, we could 
use Gibbs sampling and simulated annealing to optimize the arrangement of the areas in the new subject so 
that the probability of the observed data is maximized, finding iJ* = argmaxj:^P(P^g^|H; L, A, M). Then 
we would define FFA in the new subject as the collection of visible units belonging to the areas that were 
labeled as FFA. This procedure would greatly simplify the process of labeling known ROIs, which is currently 
laborious and manual. 


Representational similarity. One method for analyzing fMRI data that has gained popularity in recent 
years is representational similarity analysis (RSA) [TB] . In this method an ROI is selected (or an anatomical 
searchlight is used) and then the covariance of responses within that ROI are computed across stimuli to 
produce a representational dissimilarity matrix (RDM) for the ROI. These RDMs can then be compared 
across ROIs or between ROIs and models. One issue with this method is that it either requires prior 
definition of an ROI across which the RDM is to be computed, or uses an anatomical searchlight. Instead, 
PrAGMATiC could be used to find areas that have similar representational dissimilarity and anatomical 
location across subjects. 

In this method the functional means would be replaced with mean RDMs for each area. Then on each 
iteration we would compute an RDM for each area, and evaluate the likelihood of the individual RDM given 
the mean RDM. This could use a similar Wishart formulation to the covariance-based model above. 


References 

[1] D. Ackley, G. Hinton, and T. Sejnowski. A learning algorithm for boltzmann machines. Cognitive 
Science, 9(1):147-169, March 1985. ISSN 03640213. doi: 10.1016/S0364-0213(85)80012-4. URL http: 
//doi.Wiley.com/10.1016/30364-0213(85)80012-4, 

[2] K. Amunts, A. Malikovic, H. Mohlberg, T. Schormann, and K. Zilles. Brodmann’s areas 17 and 18 
brought into stereotaxic space-where and how variable? Neuroimage, ll(l):66-84, January 2000. ISSN 
1053-8119. doi: 10.1006/nimg.1999.0516. URL http://www.ncbi.nlm.nih.gov/pubmed/10686118, 

[3] N.Y. Bilenko, T. Cukur, A.G. Huth, S. Nishimoto, and J.L Gallant. FISCA: Functional Inter-Subject 
Component Analysis predicts brain activity to novel natural movies. In preparation. 

[4] R.L. Buckner, J.R. Andrews-Hanna, and D.L. Schacter. The brain’s default network: anatomy, function, 
and relevance to disease. Annals of the New York Academy of Sciences, 1124:1-38, March 2008. ISSN 
0077-8923. doi: 10.1196/annals.1440.Oil. URL http://www.ncbi .nlm.nih.gov/pubmed/18400922 

[5] K. Crane, C. Weischedel, and M. Wardetzky. Geodesics in heat: A new approach to computing distance 
based on heat flow. ACM Transactions on Craphics, 32(3):10, 2013. ISSN 07300301. doi: 10.1145/ 
2516971.2516977. URL http://dl.acm.org/citation.cfm?id=2516977 

[6] F. Crivello, T. Schormann, N. Tzourio-Mazoyer, P.E. Roland, K. Zilles, and B.M. Mazoyer. Comparison 
of spatial normalization procedures and their impact on functional maps. Human brain mapping, 16 


13 



(4):228-50, August 2002. ISSN 1065-9471. doi: 10.1002/hbm. 10047. URL http://www.ncbi.nlm.nih. 
gov/pubmed/12112765 

[7] R Epstein and N Kanwisher. A cortical representation of the local visual environment. Nature, 392 
(6676):598-601, April 1998. ISSN 0028-0836. doi: 10.1038/33402. URL http://www.ncbi.nlm.nih. 
gov/pubmed/9560155 

[8] E. Fedorenko, P.-J. Hsieh, A. Nieto-Castanon, S. Whitfield-Gabrieli, and N. Kanwisher. New method 
for fMRI investigations of language: defining ROIs functionally in individual subjects. Journal of 
neurophysiology, 104(2): 1177-94, August 2010. ISSN 1522-1598. doi: 10.1152/jn.00032.2010. URL 
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2934923, 

[9] B. Fischl, M.I. Sereno, R.B.H. Tootell, and A.M. Dale. High-resolution intersubject averaging and a 
coordinate system for the cortical surface. Human Brain Mapping, 8:272-284, 1999. ISSN 10659471. 
doi: 10.1002/(SICI)1097-0193(1999)8. 

[10] B.M. Harvey, B.P. Klein, N. Petridou, and S.O. Dumoulin. Topographic representation of numerosity 
in the human parietal cortex. Science (New York, N.Y.), 341(6150):1123-6, September 2013. ISSN 
1095-9203. doi: 10.1126/science.1239052. URL http://www.ncbi.nlm.nih.gov/pubmed/24009396 

[11] J.V. Haxby, J.S. Guntupalli, A.C. Connolly, Y.O. Halchenko, B.R. Conroy, M.I. Gobbini, M. Hanke, and 
P.J. Ramadge. A common, high-dimensional model of the representational space in human ventral tem¬ 
poral cortex. Neuron, 72(2):404-16, October 2011. ISSN 1097-4199. URL http://www.pubmedcentral. 
nih.gov/articlerender.fcgi?artid=3201764, 

[12] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 
14(8):1771-800, August 2002. ISSN 0899-7667. doi: 10.1162/089976602760128018. URL http://www. 
ncbi.nlm.nih.gov/pubmed/ 12180402 

[13] A.G. Huth, S. Nishimoto, A.T. Vu, and J.L. Gallant. A continuous semantic space describes the 
representation of thousands of object and action categories across the human brain. Neuron, 76(6): 
1210-24, 2012. ISSN 1097-4199. doi: 10.1016/j.neuron.2012.10.014. URL http://www.ncbi.nlm.nih. 
gov/pubmed/23259955 

[14] M. Jenkinson and S. Smith. A global optimisation method for robust affine registration of brain images. 
Medical image analysis, 5(2):143-56, June 2001. ISSN 1361-8415. URL http://www.ncbi.nlm.nih. 
gov/ pubmed/ 11516708 

[15] N Kanwisher, J McDermott, and M M Chun. The fusiform face area: a module in human extrastriate 
cortex specialized for face perception. The Journal of neuroscience : the official journal of the Society 
for Neuroscience, 17(11):4302-11, June 1997. ISSN 0270-6474. URL http://www.ncbi.nlm.nih.gov/ 
pubmed/9151747 

[16] N. Kriegeskorte, M. Mur, D.A. Ruff, R. Kiani, J. Bodurka, H. Esteky, K. Tanaka, and P.A. Bandettini. 
Matching categorical object representations in inferior temporal cortex of man and monkey. Neuron, 60 
(6):1126-41, December 2008. ISSN 1097-4199. doi: 10.1016/j.neuron.2008.10.043. URL http://www. 
pubmedcentral.nih.gov/articlerender.fcgi?artid=3143574, 

[17] W. Penheld and E. Boldrey. Somatic motor and sensory representation in the cerebral cortex of man 
as studied by electrical stimulation. Brain: A journal of neurology, 1937. URL http://psycnet.apa. 
org/psycinfo/ 1938-05711-001, 

[18] G.L. Romani, S.J. Williamson, and L. Kaufman. Tonotopic organization of the human auditory cortex. 
Science, pages 698-705, 1982. URL http://www.sciencemag.org/content/216/4552/1339.short 

[19] R. Saxe, M. Brett, and N. Kanwisher. Divide and conquer: a defense of functional localizers. Neurolm- 
age, 30(4):1088-96; discussion 1097-9, May 2006. ISSN 1053-8119. doi: 10.1016/j.neuroimage.2005.12. 
062. URL http://www.ncbi.nlm.nih.gov/pubmed/16635578. 


14 








[20] M. Sereno, a. Dale, J. Reppas, K. Kwong, J. Belliveau, T. Brady, B. Rosen, and R. Tootell. Borders of 
multiple visual areas in humans revealed by functional magnetic resonance imaging. Science, 268(5212): 
889-893, May 1995. ISSN 0036-8075. doi: 10.1126/science.7754376. URL http://www.sciencemag. 
org/cgi/doi/10.1126/science.7754376, 


15 



